The primary object of this experiment is to visualize zircon characteristics using UMAP and to choose reasonable parameter values.
library(plotly)
Loading required package: ggplot2
Registered S3 method overwritten by 'data.table':
method from
print.data.table
Attaching package: ‘plotly’
The following object is masked from ‘package:ggplot2’:
last_plot
The following object is masked from ‘package:stats’:
filter
The following object is masked from ‘package:graphics’:
layout
library(here)
Registered S3 methods overwritten by 'dbplyr':
method from
print.tbl_lazy
print.tbl_sql
── Attaching packages ───────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.2 ──✔ tibble 3.1.8 ✔ dplyr 1.1.0
✔ tidyr 1.3.0 ✔ stringr 1.5.0
✔ readr 2.1.3 ✔ forcats 1.0.0
✔ purrr 1.0.1
Attaching package: ‘janitor’
The following objects are masked from ‘package:stats’:
chisq.test, fisher.test
library(fs)
library(rgl)
library(umap)
Read the raw data:
file1=here("data-raw","ME_Geochron_ALL_FILTERED_02.xlsx")
df_r=read_excel(file1,sheet=3) %>% clean_names()
Warning: Expecting numeric in AZ1730 / R1730C52: got 'BDL'New names:
Extract a subset of the columns that are relevant. We consider this as the “working” dataframe.
df_w = df_r %>% rowid_to_column("zircon") %>%
select(zircon,1:geological_domain,class:geochron_id,
final_age207_206:final_age207_206_prop2se,p:u_2se)
We also create a dataframe focuses specifically of the measure variables:
df_m=df_w %>% select(-contains("2se")) %>% select(zircon:station_id,final_age207_206:u ) %>%
mutate(across(final_age207_206:u,as.numeric))
Also, we pull a dataframe focused on meta-varialbes:
df_meta=df_w %>% select(zircon:rock_type)
Visualize for missing data:
naniar::vis_miss(df_m)
We will start by using only the dataset of complete data:
df_comp=df_m %>% drop_na()
To keep the problem small (for initial testing), we will use only a subsample of those points.
df_100=df_comp %>% slice_sample(prop = .5)
Separate numeric data from label.
umap_data=df_100 %>% select(-zircon,-station_id) %>%
mutate(across(.fns=scale))
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `across(.fns = scale)`.
Caused by warning:
! Using `across()` without supplying `.cols` was deprecated in dplyr 1.1.0.
ℹ Please supply `.cols` instead.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
zirc_list_100=df_100 %>% select(zircon) %>% pull()
Fit the umap model.
umap_100=umap(umap_data)
my_plot=function(umap,zirc_list,var,facet=TRUE) {
if (facet) {
df1=as_tibble(umap$layout) %>%
mutate(zircon=zirc_list) %>%
left_join(df_meta)
p=df1 %>% ggplot(aes(V1,V2,color={{var}})) +geom_point()+
facet_wrap(vars({{var}}))+
gghighlight()+theme_minimal()
print(p)
}
p1=df1 %>% ggplot(aes(V1,V2,color={{var}})) +geom_point()+theme_minimal()
print(p1)
df1
}
df_out=my_plot(umap_100,zirc_list_100,var=station_id)
Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
Using compatibility `.name_repair`.Joining with `by = join_by(zircon)`label_key: station_id
Too many data points, skip labeling
Fit the umap model.
Change mindist
umap_100=umap(umap_data,min_dist=0.0001,n_neighbors=10,metric="manhattan")
df_out=my_plot(umap_100,zirc_list_100,var=station_id)
Joining with `by = join_by(zircon)`label_key: station_id
Too many data points, skip labeling
We elimate the points for which we dont have multivariate data (beyond 1730).
df_2 =df_m %>% slice_head(n=1730)
naniar::vis_miss(df_2)
df_v=df_2 %>% replace(is.na(.), 0)
Let’s revisualize, working up slowly to the full dataset:
df_v2=df_v %>% slice_sample(prop=.8)
umap_data=df_v2 %>% select(-zircon,-station_id) %>%
mutate(across(.fns=scales::rescale))
zirclist= df_v2 %>% select(zircon) %>% pull()
umap_v=umap(umap_data,min_dist=0.0001,n_neighbors=10,metric="manhattan")
df_out=my_plot(umap_v,zirclist,var=station_id)
Joining with `by = join_by(zircon)`label_key: station_id
Too many data points, skip labeling
df_v=df_2 %>% select(1:4,ti:sr,nb:pr,sm:gd,yb,hf,th,u)
naniar::vis_miss(df_v)
Let’s revisualize, working up slowly to the full dataset:
df_v2=df_v %>% drop_na() %>% slice_sample(prop=1)
umap_data=df_v2 %>% select(-zircon,-station_id) %>%
mutate(across(.fns=scales::rescale))
zirclist= df_v2 %>% select(zircon) %>% pull()
umap_v=umap(umap_data,min_dist=0.0001,n_neighbors=30,metric="manhattan")
df_out=my_plot(umap_v,zirclist,var=station_id)
Joining with `by = join_by(zircon)`label_key: station_id
Too many data points, skip labeling
df_out=my_plot(umap_v,zirclist,var=zone)
Joining with `by = join_by(zircon)`
df_out=my_plot(umap_v,zirclist,var=geological_domain)
Joining with `by = join_by(zircon)`label_key: geological_domain
Too many data points, skip labeling
df_out=my_plot(umap_v,zirclist,var=rock_type)
Joining with `by = join_by(zircon)`label_key: rock_type
Too many data points, skip labeling
umap_v3=umap(umap_data,min_dist=0.0001,n_neighbors=30,metric="manhattan",n_components=3)
# umap_labels=df_v2 %>% select(zircon) %>%
# left_join(df_meta) %>% pull(rock_type)
umap_labels=df_v2 %>% select(zircon) %>%
left_join(df_meta)
Joining with `by = join_by(zircon)`
df3=as_tibble(umap_v3$layout) %>%
bind_cols(umap_labels)
#df1 %>% ggplot(aes(V1,V2,color=g)) +geom_point()+theme_minimal()
df3 %>% ggplot(aes(V1,V2,color=geological_domain)) +geom_point()+theme_minimal()+
facet_wrap(~station_id)
NA
NA
plot_ly(df3, x = ~V1, y = ~V2, z = ~V3, color = ~geological_domain,size=3)
No trace type specified:
Based on info supplied, a 'scatter3d' trace seems appropriate.
Read more about this trace type -> https://plotly.com/r/reference/#scatter3d
No scatter3d mode specifed:
Setting the mode to markers
Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
No trace type specified:
Based on info supplied, a 'scatter3d' trace seems appropriate.
Read more about this trace type -> https://plotly.com/r/reference/#scatter3d
No scatter3d mode specifed:
Setting the mode to markers
Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
Lets check for correlations:
library(GGally)
Registered S3 method overwritten by 'GGally':
method from
+.gg ggplot2
pairs(umap_data %>% select(8:13))
What does PCA reveal?
my_pca <- prcomp(umap_data)
summary(my_pca)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12
Standard deviation 0.3047 0.1380 0.13471 0.12455 0.1139 0.10221 0.08158 0.07112 0.05168 0.04513 0.03353 0.0253
Proportion of Variance 0.4935 0.1012 0.09649 0.08247 0.0690 0.05555 0.03539 0.02689 0.01420 0.01083 0.00598 0.0034
Cumulative Proportion 0.4935 0.5947 0.69118 0.77365 0.8427 0.89820 0.93359 0.96048 0.97468 0.98551 0.99149 0.9949
PC13 PC14 PC15
Standard deviation 0.02123 0.01847 0.0130
Proportion of Variance 0.00240 0.00181 0.0009
Cumulative Proportion 0.99729 0.99910 1.0000
10 principal compents is sufficient.
umap_p=umap(my_pca$x[,1:10],min_dist=0.0001,n_neighbors=30,metric="manhattan")
df_out=my_plot(umap_p,zirclist,var=station_id)
Joining with `by = join_by(zircon)`label_key: station_id
Too many data points, skip labeling
PCA seems to not be a good choice.
df1=as_tibble(umap_v$layout) %>%
mutate(zircon=zirclist) %>%
left_join(df_meta)
Joining with `by = join_by(zircon)`
df1 %>% ggplot(aes(V1,V2,color=class)) +geom_point()+
facet_wrap(vars(zone,station_id))+
theme_minimal()
NA